Skip to content

Scale NN distances and sort by distance and index#105

Merged
grovduck merged 9 commits intomainfrom
discretize-distances
Jan 31, 2026
Merged

Scale NN distances and sort by distance and index#105
grovduck merged 9 commits intomainfrom
discretize-distances

Conversation

@grovduck
Copy link
Member

@grovduck grovduck commented Jan 22, 2026

Due to floating-point differences between operating systems when using np.dot, very close neighbor distances can swap position leading to differences in neighbor indexes and, thus, predictions. This PR first scales and discretizes distances such that very similar or exact distances will share the same value in which case they are subsequently sorted by neighbor index.

This change affected the neighbor sorting of four RFNN tests such that, in the case of very close or tied distances, the neighbor with the smaller index was chosen as the nearer neighbor. The fix is the result of using np.dot to calculate Hamming distances. No other estimators' regression tests were affected by this change.

@aazuspan, let me know what you think of this fix. I went back and forth on surfacing the EPS value as a parameter. For now, it is not configurable, but I don't have a strong feeling as to whether a user would want this flexibility. Also, if you have a sense whether this is an inefficient calculation, please let me know. I didn't notice that tests slowed down much, but there may be some optimization that I'm not aware of.

Lastly, even though I think this should only impact estimators that use Hamming distance, I decided to implement in RawKNNRegressor.kneighbors thinking that this defined sorting behavior would be nice for all estimators. The one place that I can see where this would trip users up is in cross-validation if one or more samples have identical feature values. The sample with the lower index will still be its own nearest neighbor, whereas the sample with the higher index would now be at best the second nearest neighbor. In our experience, co-located plots can sometimes exhibit this behavior.

PS: Even though we were surprised not to see this bug when working on #101, it did come back to bite us in current work on #99. I'd like to get this fix in place before going back to that PR.

Due to floating-point differences between operating systems when using
`np.dot`, very close neighbor distances can swap position leading to
differences in neighbor indexes and, thus, predictions.  This commit
first scales and dicretizes distances such that very similar or exact
distances will share the same value in which case they are subsequently
sorted by neighbor index.
@grovduck grovduck self-assigned this Jan 22, 2026
@grovduck grovduck added the bug Something isn't working label Jan 22, 2026
@grovduck grovduck requested review from aazuspan and Copilot January 22, 2026 17:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses floating-point precision issues in nearest neighbor distance calculations by implementing a deterministic sorting approach. When distances are very similar due to platform-specific floating-point differences in np.dot, the new implementation scales and discretizes distances, then sorts first by scaled distance and then by neighbor index to ensure consistent ordering across platforms.

Changes:

  • Modified RawKNNRegressor.kneighbors to scale distances by epsilon (1e-10) and use lexicographic sorting by distance then index
  • Updated four RFNN regression test files to reflect the new deterministic neighbor ordering

Reviewed changes

Copilot reviewed 1 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/sknnr/_base.py Added distance scaling and lexicographic sorting to ensure deterministic neighbor ordering when distances are very close
tests/test_regressions/test_mixed_rfnn_forests_target_.npz Updated expected test outputs for RFNN with new sorting behavior
tests/test_regressions/test_mixed_rfnn_forests_reference_.npz Updated expected test outputs for RFNN reference data with new sorting behavior
tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_index_.npz Updated expected test outputs for kneighbors with new sorting behavior
tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_ids_.npz Updated expected test outputs for kneighbors with dataframe ids and new sorting behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aazuspan
Copy link
Contributor

@grovduck, I am slightly hesitant about the fix, but I'm more than willing to defer to your experience here if you think this is the right solution. I think most of my reservations are because this seems like we're changing behavior of the library code for the sake of the test code, but maybe that's off-base and it would be valuable more generally to improve reproducibility across platforms?

In any case, I'm also not sure what harm this could cause, aside from a negligible loss in precision in neighbor comparisons (presumably the difference between two neighbors past 10 decimal places is effectively meaningless).

I went back and forth on surfacing the EPS value as a parameter. For now, it is not configurable, but I don't have a strong feeling as to whether a user would want this flexibility.

If we had some kind of configuration API (something like pandas.options) I could see that being a good place for this, but as a parameter, I could see that the benefit of flexibility might be outweighed by the difficulty explaining it.

You could consider adding a parameter to enable/disable it, but I'm not sure whether that's worth it either.

Also, if you have a sense whether this is an inefficient calculation, please let me know. I didn't notice that tests slowed down much, but there may be some optimization that I'm not aware of.

It looks nicely vectorized to me. I was curious about the performance impact and ran some quick tests with 4000 samples, 10 features, and 10 neighbors, and the added code for scaling, rounding, sorting, and re-indexing only increased the kneighbors time from 194 ms to 195.7 ms, so it seems almost negligible.

Lastly, even though I think this should only impact estimators that use Hamming distance, I decided to implement in RawKNNRegressor.kneighbors thinking that this defined sorting behavior would be nice for all estimators.

Yeah, I agree that being consistent across estimators makes sense.

The one place that I can see where this would trip users up is in cross-validation if one or more samples have identical feature values. The sample with the lower index will still be its own nearest neighbor, whereas the sample with the higher index would now be at best the second nearest neighbor. In our experience, co-located plots can sometimes exhibit this behavior.

I don't think I'm totally following... Does this mean that you might somehow compare a feature against itself during cross-validation, whereas you'd otherwise get an independent feature? Or is it the difference between which of two identical features you'd identify as the first neighbor?

@grovduck
Copy link
Member Author

@grovduck, I am slightly hesitant about the fix, but I'm more than willing to defer to your experience here if you think this is the right solution. I think most of my reservations are because this seems like we're changing behavior of the library code for the sake of the test code, but maybe that's off-base and it would be valuable more generally to improve reproducibility across platforms?

@aazuspan, I can completely understand your hesitancy here and I appreciate you making me elaborate on why I think it's needed. Definitely the impetus for introducing the fix was failing tests and I concede that point. But I think the bigger issue is that we need consistency across platforms for neighbor order, which impacts predictions and scores. My thought is that the neighbor assignment needs to be deterministic across all platforms/users and the tests exposed that this wasn't the case.

We're obviously seeing this behavior of tied distances much more in the estimators that use Hamming distance because its simply a proportion of matched nodes between target and reference (as we've discussed before). If we use equally weighted trees (the default in RFNN and an option for GBNN), the probability of tied distances is even higher.

KNeighborsRegressor itself warns that "Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data."

Looking at this example with two identical features:

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)

# Independent neighbors
print(est.kneighbors()[1])
# [[2 1]
#  [0 2]
#  [0 1]]

# Dependent neighbors
print(est.kneighbors(X)[1])
# [[0 2]
#  [1 0]
#  [0 2]] - Not returning itself as first neighbor

I feel what I'm proposing here is relaxing the criterion on what makes a tie distance to correct for numerical imprecision.

I think my main argument is that, given the exact same inputs and parameters, estimators should be deterministic. Perhaps what you are thinking is that the platform, package versions, etc. are part of those inputs/parameters and that they are deterministic in that sense. What I'm proposing in this fix is a "broadening" of that determinism to account for some of these differences (and, of course, to make tests pass!) But please push back if you don't think this the right way to go about this. I just haven't come up with other options.

In any case, I'm also not sure what harm this could cause, aside from a negligible loss in precision in neighbor comparisons (presumably the difference between two neighbors past 10 decimal places is effectively meaningless).

Yes, I think this is what I'd argue as well. Distances past x decimal places are meaningless, so we may as well set neighbors in a way that will be reproduceable.

If we had some kind of configuration API (something like pandas.options) I could see that being a good place for this, but as a parameter, I could see that the benefit of flexibility might be outweighed by the difficulty explaining it.

You could consider adding a parameter to enable/disable it, but I'm not sure whether that's worth it either.

Your second thought here is intriguing. That would allow tests to pass and allay my concerns about reproducibility (and I think I'd make the proposed fix the default), but would still allow a user to use the true calculated distances. We'd have to surface the parameter on TransformedKNeighborsRegressor.kneighbors as well, but not terribly disruptive. Let me know if you think that's worth it.

It looks nicely vectorized to me. I was curious about the performance impact and ran some quick tests with 4000 samples, 10 features, and 10 neighbors, and the added code for scaling, rounding, sorting, and re-indexing only increased the kneighbors time from 194 ms to 195.7 ms, so it seems almost negligible.

Thanks for going to the trouble of doing that testing. Good to know that the effect is negligible.

I don't think I'm totally following... Does this mean that you might somehow compare a feature against itself during cross-validation, whereas you'd otherwise get an independent feature? Or is it the difference between which of two identical features you'd identify as the first neighbor?

Sorry, that explanation wasn't totally clear and using cross-validation as a term in my explanation didn't help. But it's the latter clause that could cause issues. I'll try to give a clear example that isn't necessarily related to cross-validation.

In our GNN modeling, we capture dependent neighbors/predictions as well as independent neighbors/predictions. Dependent predictions are when a plot can use itself as a neighbor. As one of our tests to identify plots for screening, we want to know how "deep" in the neighbor stack a plot had to go to self-impute1. In sknnr as currently defined, this is always the first neighbor given that the distance between a plot and itself will always be zero. With the proposed fix, that self-imputation could be as deep as the number of identical features -or- even the number of features that evaluated to the same distance (think tied number of matches in RFNN). So a feature that had an index of 3000 would self-impute at neighbor 3 if indices 1000 and 2000 had the same rounded distance. This wouldn't come up in LOGO cross-validation, but it would still impact this outlier screen.

In our experience, it's very rare to have samples that share all feature values so this wouldn't come up often. There are a couple of other corner cases that come up, but I'm recognizing that my explanation is pretty rambling at this point and it may be easier to talk it through.

Footnotes

  1. This is complicated by the fact that we do this evaluation at pixel scale, with nine pixels per plot, so pixel X values do not necessarily equal plot X values. We then take a mean of these self-imputation ranks across the nine pixels. High values can show areas where a plot footprint is quite heterogeneous and perhaps not suitable for modeling.

@grovduck grovduck changed the title Scale NN distances by epsilon and sort by distance and index Scale NN distances and sort by distance and index Jan 23, 2026
@aazuspan
Copy link
Contributor

Thanks for the details on the motivation and how you see this fitting into sklearn. I'm convinced that it goes beyond a patch for our testing framework and addresses a real limitation of sklearn, so I'm onboard for the change :)

That would allow tests to pass and allay my concerns about reproducibility (and I think I'd make the proposed fix the default), but would still allow a user to use the true calculated distances. We'd have to surface the parameter on TransformedKNeighborsRegressor.kneighbors as well, but not terribly disruptive. Let me know if you think that's worth it.

That was my thought. I imagine the parameter won't have any effect in most cases, but as a user I think I would feel more comfortable knowing it was there just in case there was something special about my data that caused issues (the only thing I can think of would be a dataset where most of the distances are very small with a few very large distances that lead to a huge scaler and a loss of meaningful precision at the low end, but I'm not sure that would ever arise in reality). Having to pass the parameter through TransformedKNeighborsRegressor is inconvenient, though.

With the proposed fix, that self-imputation could be as deep as the number of identical features -or- even the number of features that evaluated to the same distance (think tied number of matches in RFNN). So a feature that had an index of 3000 would self-impute at neighbor 3 if indices 1000 and 2000 had the same rounded distance.

Thanks, I think I'm mostly understanding now. Just as a thought, what if we broke distance ties based on the absolute "distance" between their indices, so that if plot ID 3000 had identical features to 1000 and 4000, the nearest neighbors would be [3000, 4000, 1000]? I had to tweak the example there since there's a chance to tie with the distance between indices as well, so I suppose we'd need a third tie-breaker with the index itself. That may be too targeted of a fix for such a specific issue and moves further from sklearn behavior, so I don't know if it's a good idea or not.

When ordering neighbors, distances are used as the first screen.  If
distances are considered to be equal and `use_deterministic_ordering` is
`True`, use the absolute neighbor index difference as the second-level
ordering.  This ensures that when `kneighbors` is called with the `X`
used to fit the estimator (i.e. non-independent cross-validation), the
first neighbor returned for each query point will be itself.
@grovduck
Copy link
Member Author

@aazuspan, I've taken another hack at this to address the three remaining issues:

  1. Make DISTANCE_PRECISION_DECIMALS a class attribute. I settled on this name to define to number of decimals to be used when scaling and rounding distances. Not sure I love the name, so happy if you have other suggestions. This is defined as a class attribute on RawKNNRegressor and I added documentation under the Attributes section. I've also added a test (test_estimators.py::test_kneighbors_precision_decimals) that monkey-patches this value. I've never written a test with monkey-patching before, so please let me know if this isn't the correct way to go about it.
  2. Expose use_determinintic_ordering as a parameter. This is added to both RawKNNRegressor and TransformedKNeighborsRegressor as a pass-through. Again, open to different naming on the parameter. If this value is True, it runs through the deterministic ordering and, if False, it uses the default sorting of sklearn's KNeighborsRegressor.kneighbors. Again, I have a new test (test_estimators.py::test_kneighbors_deterministic_ordering) that tests the effect of setting this flag. Note that this test and the first test are very similar, so we could perhaps collapse into one. I went back and forth on this and it was easier for me to understand the tests if they were separated.
  3. Add absolute neighbor index difference as second-level sort. I liked your idea of adding this as a secondary sort and now have this implemented. I understand that this is serving a very narrow use case, but I think it's a nice property to have when doing non-independent imputation - all reference query points should have themselves as their first neighbor. Another test for this one (test_estimators.py::test_kneighbors_uses_index_difference) that shows the effect of passing two duplicate query points which differ in their neighbors based on their row index. Note that this modification once again changed some of the regression results, so I've pushed new versions of those here.

Thanks for your continued patience.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grovduck, this looks great! I agree with all your decisions and naming choices, and the new tests are very thorough.

The only change I'd propose is that it might be worth adding a new subheading to the Usage guide to explain how the deterministic ordering works and where it might differ from the default sklearn behavior, since that seems like a notable new feature. The parameter docstring could then repeat some of that information and/or link to the usage guide for more info.

@grovduck
Copy link
Member Author

The only change I'd propose is that it might be worth adding a new subheading to the Usage guide to explain how the deterministic ordering works and where it might differ from the default sklearn behavior, since that seems like a notable new feature. The parameter docstring could then repeat some of that information and/or link to the usage guide for more info.

This is a great idea. I've now added a section to the user's guide and linked from the docstrings.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grovduck, the expanded docs look great! I had a few comments for you to consider, but I'm happy whenever you're ready to merge.

@grovduck
Copy link
Member Author

@aazuspan, thanks for the final review. I think I've addressed your helpful comments and will merge when you've had a chance to quickly review.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This was a sticky issue, so it's great to have it solved :)

@grovduck grovduck merged commit 4874ba7 into main Jan 31, 2026
11 checks passed
@grovduck grovduck deleted the discretize-distances branch January 31, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants